Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added ability to use regex rules #133

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

cognitivegears
Copy link
Contributor

Please check if the PR fulfills these requirements

  • [ X ] The commit message follows our guidelines
  • [ X ] Tests for the changes have been added (for bug fixes / features)
  • [ X ] Docs have been added / updated (for bug fixes / features) see docs/README.md

What kind of change does this PR introduce? (Bug fix, feature, docs update, ...)
New feature

What is the current behavior? (You can also link to an open issue here)
Currently there is no way to specify a regular expression in a rule. This means that some rules are difficult to create, for example:

  • Smart word boundaries on multi word terms, such as "Black List" matching even with a tab or other charcter (e.g. "black\s*list")
  • A rule that matches "master" without word boundaries in order to match both master and words composed with master (i.e. IsMasterRepo), but excluding specific words, like "masters". (e.g. "master(?:\b|[^s])")
  • Many others... With regular expressions, flexible rules can be created to match many different cases

What is the new behavior (if this is a feature change)?
Added a new option "regex_terms" on the rules to specify that they should be treated as regular expressions. If set to yes, then terms are not escaped before adding to the terms list and errors displayed / rules set to ignore from the regular expression compile on error.

Does this PR introduce a breaking change? (What changes might users need to make due to this PR?)
No

Other information:
N/A

@codecov
Copy link

codecov bot commented Aug 12, 2021

Codecov Report

Merging #133 (e3fdd00) into main (1fe097d) will increase coverage by 0.01%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #133      +/-   ##
==========================================
+ Coverage   95.90%   95.92%   +0.01%     
==========================================
  Files          21       21              
  Lines         708      711       +3     
==========================================
+ Hits          679      682       +3     
  Misses         18       18              
  Partials       11       11              
Impacted Files Coverage Δ
pkg/rule/rule.go 94.54% <100.00%> (+0.15%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1fe097d...e3fdd00. Read the comment docs.

@caitlinelfring
Copy link
Member

I've been wanting this feature, but the schema I had in mind was something like this:

rules:
  - name: foo
    terms:
      - foo
      - Foo
  - name: bar
    regex: "(?i)bar"

# when both are defined, `terms` wins, `regex` would be ignored, maybe log debug message that both are defined?
  - name: baz
    terms:
      - baz
      - BaZ
    regex: "(?i)baz"

docs/rules.md Outdated

* If `true`, terms will be evaluated as regular expressions
* If `false`, terms will be treated as plain-text values
* **NOTE** this is an advanced feature. Rules will be skipped if they do not compile. Only use non-capturing groups in patterns. Look-around assertions are not supported.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a fatal error should happen if a supplied regex can't compile. Otherwise the user may never know that their rule isn't being used.

Copy link
Contributor Author

@cognitivegears cognitivegears Aug 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure - I added a Error log message, so when it runs if a rule won't compile, an error is generated but it continues, it looks like this:

2021-08-13T09:03:27-05:00 ERR Unable to compile regular expression, disabling rule error="error parsing regexp: missing closing ): `(?i)(black(\\s*list)`" Rule=rulename

I thought about letting it do a fatal error instead, though I chose this approach in anticipation of the remote ruleset feature - my thought was that if there is a rule that doesn't work in a remote ruleset that the user can't necessarily directly change, it could continue on after printing the error. That said, I'm totally happy to go either way with this - if you want it a fatal error that's easy to change, I can just change it back to a MustCompile instead, or doing a panic after logging the error so that they still get that debug info from that log. What do you prefer here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my thought was that if there is a rule that doesn't work in a remote ruleset that the user can't necessarily directly change

I think we should tackle this when it becomes a problem or concern.

@cognitivegears
Copy link
Contributor Author

I've been wanting this feature, but the schema I had in mind was something like this:

rules:
  - name: foo
    terms:
      - foo
      - Foo
  - name: bar
    regex: "(?i)bar"

# when both are defined, `terms` wins, `regex` would be ignored, maybe log debug message that both are defined?
  - name: baz
    terms:
      - baz
      - BaZ
    regex: "(?i)baz"

No problem, should be pretty straightforward. I'll change this.

@cognitivegears
Copy link
Contributor Author

I've been wanting this feature, but the schema I had in mind was something like this:

rules:
  - name: foo
    terms:
      - foo
      - Foo
  - name: bar
    regex: "(?i)bar"

# when both are defined, `terms` wins, `regex` would be ignored, maybe log debug message that both are defined?
  - name: baz
    terms:
      - baz
      - BaZ
    regex: "(?i)baz"

No problem, should be pretty straightforward. I'll change this.

Oh - one more question. I see the (?i) in your example. I assume that means that you are intending that this wouldn't do the wrapping at all that it does for terms, so word_boundary, word_boundary_start, and word_boundary_end would also be ignored? That makes sense, but in that case should that also then put a debug message if any of those are defined just to make it clear what is happening?

@caitlinelfring
Copy link
Member

I assume that means that you are intending that this wouldn't do the wrapping at all that it does for terms, so word_boundary, word_boundary_start, and word_boundary_end would also be ignored? That makes sense, but in that case should that also then put a debug message if any of those are defined just to make it clear what is happening?

That's correct, if you supply regex, no other options that are used to generate the regex internally would be used. regex should be a "use at your own risk" config setting.

@cognitivegears
Copy link
Contributor Author

Sorry for the delay - updated the code to follow the suggested changes above.

@cognitivegears
Copy link
Contributor Author

I see that the benchmark tool is failing. I'm not sure why that would be, however, unless there is some instability in the benchmarking process itself. This change should not represent any runtime change in performance, and only a single additional conditional when compiling rules. If I'm missing something there though please let me know so that I can help resolve.

@caitlinelfring
Copy link
Member

I see that the benchmark tool is failing. I'm not sure why that would be, however, unless there is some instability in the benchmarking process itself. This change should not represent any runtime change in performance, and only a single additional conditional when compiling rules. If I'm missing something there though please let me know so that I can help resolve.

Apologies for not responding for so long! The benchmark test is failing because it's supposed to create a comment on the PR with the results, but the github action that handles the PR commenting doesn't work on forks. I'm disabling that part in #147.

Aside from that, looking at the output of the benchmark tool (which just runs go test -bench on the branch and compares it to the results on main), seems like BenchmarkRootRunE has higher ns/op. Would you be able to look into why?

@Puneethgr
Copy link

Puneethgr commented Feb 21, 2023

@cognitivegears Please let me know if there are any updates. This is a nice feature.

@simskij
Copy link

simskij commented Mar 27, 2023

@caitlinelfring we're using this quite extensively at Canonical, and really need this feature. anything i could assist you with to get this through?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants